Assignment 1

Using linkscape to adjust visualizations obtained by file tree1.pdf. Here is our result:
Updated tree

Updated tree

Assignment 2

2.1 Read senic data

Here is a quick overview of our data:

head(senic)
##   V1    V2   V3  V4   V5    V6  V7 V8 V9 V10 V11 V12
## 1  1  7.13 55.7 4.1  9.0  39.6 279  2  4 207 241  60
## 2  2  8.82 58.2 1.6  3.8  51.7  80  2  2  51  52  40
## 3  3  8.34 56.9 2.7  8.1  74.0 107  2  3  82  54  20
## 4  4  8.95 53.7 5.6 18.9 122.8 147  2  4  53 148  40
## 5  5 11.20 56.5 5.7 34.5  88.9 180  2  1 134 151  40
## 6  6  9.76 50.9 5.1 21.9  97.0 150  2  2 147 106  40
dim(senic)
## [1] 113  12

2.2 Detect outlier function

We create a function call detect_oulier, which receive a column of SENIC data and do the following:
1. Compute the first and third quantiles Q1 and Q3
2. Return the indices of outlying observations

Here is an example with Infection risk (column 4 of SENIC data)

# X is the input vector
detect_outlier <- function(X){
  quan <- quantile(X, c(0.25, 0.75))    #Q1 and Q3
  upper <- quan[2] + 1.5*(quan[2]-quan[1])
  bottom <- quan[1] - 1.5*(quan[2]-quan[1])
  idx <- which((X>upper)|(X<bottom))
  return(idx)   # return the positions of outliers
}
detect_outlier(senic[,4])
## [1]  13  40  53  54  93 107

2.3 Density plot of Infection risk

Following the requirements, we draw a density plot of Infection risk and pinpoint the outliers with a diamond shape.
The shape of the result is quite similar with normal distribution (with a litter bit asymmetric). The major inflection risk rate is approximately between 1% and 7% for our sample.

2.4 Graphs for all other quantitative variables

We draw the graphs for all quantitative variables as figure 2.3. We ignore the feature with ‘ID’, ‘Medical School Affiliation’ and ‘Region’, since ‘ID’ is an unnecessary feature and ‘Medical School Affiliation’ and ‘Region’ are not features with great range.

On the one hand, the plots of “Length of Stay”, “RC Ratio”, “Beds”, “ADC”, “Nurses” are asymmetric. It is understandable because, for example, most of hospitals cannot provide too many nurses and beds, thereby satisfying numerous patients.

On the other hand, the graph of “Age”, “Facilities & Services”, “RCX Ratio”, “Infection Risk” are symmetric as we expect. For example, to maintain the medical ability, the facilities and services of all hospitals are approximately similar. Only minority of hospitals can provide fewer or more compared with the others.

2.5 Scatter plot of Infection risk on the Number of Nurses (colored by Number of Beds)

We notice that a higher number of beds does not lead to a reduction in the infection-risk rate. It means that we cannot ensure the dominant importance of number of beds to the infection-risk reduction. The huge number of the beds might represent that the infection can easily happen in the hospital. Some other reasons should be taken into the consideration as well,i.g. Length of Stay.

2.6 Convert graph from step 3 to Plotly with ggplotly function

A new functionality that graph from step 3 does not have is we can see the exact value of each point when moving the mouse to the point of graph 2.6. Additionally, we can zoom in and out to focus some exact parts by graph 2.6.

2.7 Use plotly directly to draw histogram of Infection risk

2.8 Repeat step 4 with shiny app

library(shiny)
ui <- fluidPage(
  titlePanel("Question 2.8 shiny-plot"),
  sidebarLayout(
    sidebarPanel(
      checkboxGroupInput(inputId = "select",
                        label = "Select Variables:",
                        choices = colnms),
      sliderInput(inputId = "chg",
                  label = "Choose Bandwidth: ",
                  min = 0.1,
                  max = 50,
                  value = 1)
    ),
    mainPanel(plotOutput("plots"))
  )
)
server <- function(input, output) {
  output$plots <- renderPlot({
    validate(
      need(input$select, "Please select")
    )
    
    plots <- lapply(input$select, function(cname0)
      {
        cname <- cols[colnms==cname0]
        idx <- detect_outlier(senic[,cname])
        x.axis <- senic[idx,cname]
        y.axis <- rep(0,length(x.axis))
        ggplot(data=senic, aes_string(x=cname))+
          geom_density(fill="cornsilk", colour="grey40", size=.2, bw=input$chg)+
          geom_point(data=as.data.frame(x.axis), aes(x=x.axis,y=y.axis), shape=5, size=3)+
          scale_x_continuous(name = colnms[which(cols==cname)])
      }
    )
    plot(arrangeGrob(grobs=plots))
  })
}
shinyApp(ui = ui, server = server)

When varying bandwidth decreasingly, the graph changes from very smooth (bandwidth =10) to quite close to precise data (bandwith=1).

From our perspective, the value of bankwidth should depend on the sample data, the larger variance among the data, the rougher density figure we get using the same bandwidth. So in particular for ‘Length of Stay’, ‘Age’ and ‘Infection risk’ the bankwidth around 1 is enough. In particular for ‘RC Ratio’, ‘RCX Ratio’ and ‘Facilities & Serives’ is around 5. Finally for ‘Beds’, ‘Nurses’ and ‘ADC’ is approximately 50.

Appendix

library(ggplot2)
library(gridExtra)
library(plotly)
library(shiny)

#2.1
#file_path <- "C:/Users/Duong Minh Duc/Desktop/SENIC.txt"
file_path <- "C:/Users/fengy/Desktop/Master/732A98 Visualization/lab1/SENIC.txt"
senic <- read.csv(file = file_path , sep = "", header = FALSE)
rm(file_path)

#2.2
detect_outlier <- function(X){
  quan <- quantile(X, c(0.25, 0.75))    #Q1 and Q3
  upper <- quan[2] + 1.5*(quan[2]-quan[1])
  bottom <- quan[1] - 1.5*(quan[2]-quan[1])
  idx <- which((X>upper)|(X<bottom))
  return(idx)   # return the positions of outliers
}
 
#2.3
outlier <- detect_outlier(senic[,4])
value <- senic$V4[outlier]
p2.3 <- ggplot(senic, aes(x=senic$V4))+
  geom_density(fill="cornsilk", colour="grey60", size=.3) +
  geom_point(data=as.data.frame(outlier), aes(x=value, y=0), shape=5, size=3) +
  ggtitle("Inflection-risk density plot") +
  scale_x_continuous(name = "Inflection risk")
p2.3
 
#2.4
colnms <- c("Length of Stay","Age","Infection Risk",
            "RC Ratio","RCX Ratio","Beds","ADC","Nurses",
            "Facilities & Services")
cols <- names(senic[,c(2,3,4,5,6,7,10,11,12)])
f <- function(cname)
{
  idx <- detect_outlier(senic[,cname])
  x.axis <- senic[idx,cname]
  y.axis <- rep(0,length(x.axis))
  ggplot(data=senic, aes_string(x=cname))+
    geom_density(fill="cornsilk", color="grey40", size=.3)+
    geom_point(data=as.data.frame(x.axis), aes(x=x.axis,y=y.axis), shape=5, size=3)+
    scale_x_continuous(name = colnms[which(cols==cname)])
}
plot(arrangeGrob(grobs=lapply(cols, f)))

#2.5
ggplot(senic, aes(x=senic$V4, y=senic$V11,colour =senic$V7)) +
    geom_point() +
    ggtitle("Dependence of Infection risk on the Number of Nurses") +
    scale_x_continuous(name ="Inflextion risk") +
    scale_y_continuous(name = "Num Nurses") +
    scale_colour_continuous(name = "Num Beds")

#2.6
p2.6 <- ggplotly(p2.3)
p2.6

#2.7
p2.7 <- plot_ly(x = ~senic$V4, type = "histogram") 
add_trace(p2.7, x= ~value, y=~0 , type = "scatter", symbol= I(5), mode="markers", showlegend=FALSE) %>%
layout( title="Histogram plot of Inflection risk", xaxis = list(title = "Inflection risk"), yaxis=list(title= "Number"))

#2.8
ui <- fluidPage(
  titlePanel("Question 2.8 shiny-plot"),
  sidebarLayout(
    sidebarPanel(
      checkboxGroupInput(inputId = "select",
                        label = "Select Variables:",
                        choices = colnms),
      sliderInput(inputId = "chg",
                  label = "Choose Bandwidth: ",
                  min = 0.1,
                  max = 50,
                  value = 1)
    ),
    mainPanel(plotOutput("plots"))
  )
)
server <- function(input, output) {
  output$plots <- renderPlot({
    validate(
      need(input$select, "Please select")
    )
    plots <- lapply(input$select, function(cname0)
      {
        cname <- cols[colnms==cname0]
        idx <- detect_outlier(senic[,cname])
        x.axis <- senic[idx,cname]
        y.axis <- rep(0,length(x.axis))
        ggplot(data=senic, aes_string(x=cname))+
          geom_density(fill="cornsilk", colour="grey40", size=.2, bw=input$chg)+
          geom_point(data=as.data.frame(x.axis), aes(x=x.axis,y=y.axis), shape=5, size=3)+
          scale_x_continuous(name = colnms[which(cols==cname)])
      }
    )
    plot(arrangeGrob(grobs=plots))
  })
}
shinyApp(ui = ui, server = server)